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Abstract. The possibility of improving on the usual multivariate nor- 
mal confidence was first discussed in Stein (1962). Using the ideas of 
shrinkage, through Bayesian and empirical Bayesian arguments, domi- 
nation results, both analytic and numerical, have been obtained. Here 
we trace some of the developments in confidence set estimation. 

Key words and phrases: Stein effect, coverage probability, empirical 
Bayes. 



1. INTRODUCTION 



the 



In estimating a multivariate normal mean, 
usual p-dimensional 1 — a confidence set is 

(1) Cl, = {9:\6-x\<ca}, 

where we observe X = x, where X is a random varia- 
ble with a p-variate normal distribution with mean 6 
and covariance matrix cr^I, X r-^ N(6,a'^I), I is the 
pxp identity matrix, and c^ is the upper a cutoff of 
a chi-squared distribution, satisfying P{Xp ^ c^) = 
1 — a. 

Although the above formulation looks somewhat 
naive, it is very relevant in applications of the linear 
model, still one of the most widely-used statistical 
models. For such models, typical assumptions lead 
to /3 ~ A^(/3,(T^I1), where (3 is the least squares esti- 
mator (and MLE under normality), /3 is the vector 
of regression slopes and S is a known covariance ma- 
trix (typically depending on the design matrix) . The 
usual confidence set for f3 is 

(2) {/3:(^-/3)'S-i(^-/3)<cV}. 
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Letting x = S-i/2/3 and 6 = Y.-^l'^p reduces (2) 
to (1). 

In theoretical investigations of confidence sets and 
procedures, we often first take a^ known. When a^ is 
unknown, the usual strategy is to replace it by some 
usual estimator, such as the sample variance s^. 
Under normality, if s^ has v degrees of freedom, 
then s^ ~ (j'^xl, independent of /3. For example, the 
usual F confidence set for the regression parameters 
based on a linear model can be reduced to C^ ^ with 
the usual unbiased estimator s^ substituted for a"^ . 
This is the usual Scheffe confidence set. Unfortu- 
nately, contrary to the point estimation case, there 
are few theoretical results for unknown cr^. However, 
there is continued numerical evidence that the usual 
confidence set can be dominated in the unknown 
variance case (see, e.g., Casella and Hwang, 1987). 
Moreover, Hwang and Ullah (1994) argue that the 
domination of the alternative fixed radius confidence 
spheres for the unknown o"^ case, over Scheffe's set, 
holds with a larger shrinkage factor. 

Since we are assuming that o"^ is known, we take 
it equal to 1 and (1) becomes 



(3) 



{e:\e-x\<c]. 



We now ask the question of whether it is possible 
to improve on C^ in the sense of finding a confidence 
set C such that, for all 9 and x: 

(i) Pe{eciC')>Pe{9(^Cl)- 
(ii) volume of C < volume of C^; 

with strict inequality holding in either (i) or (ii) for 
a set 9 or X with positive Lebesgue measure. The 
answer to this question may be yes for higher di- 
mensional cases, as suggested by the work of Stein. 
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The celebrated work of James and Stein (1961) 
shows that the estimator 



The existence results of Brown and Joshi are based 
on spheres centered at 



(4) 



d'^ix) 



1 



dominates X with respect to squared error loss if 
0<a<2(p-2), that is, 

^ ' ^^ ^ ' I \<Ee\X-9\^ for some 0. 

In practice, this estimator has the deficiency of a sin- 
gularity at in that lim|2,|_j.o(J''^(x) = — oo. This de- 
ficiency can be corrected with the positive part es- 
timator (appearing in Baranchik, 1964, and men- 
tioned as Example 1 in Baranchik, 1970) 



'^^(-)= i-o^ 



(6) 



where (6)^ = max{0, b}. This estimator actually im- 
proves on (5 (x) and is so good that, even though it 
was known to be inadmissible, it took 30 years to 
find a dominating estimator (Shao and Strawderman, 
1994). The removal of the singularity makes 6~^{x) 
a more attractive candidate for centering a confi- 
dence set. 

A simple proof of (5) can be found in Stein (1981); 
see also Lehmann and Casella (1998), Chapter 5. 
Therefore, it seems reasonable to conjecture that we 
can use a Stein estimator to dominate the confidence 
set C^. Although this turns out to be the case, it is 
a very difficult problem. 

2. RECENTERING 

Stein (1962) gave heuristic arguments^ that sho- 
wed why recentered sets of the form 

(7) Cl = {9:\6-6^{^)\<c} 

would dominate the usual confidence set (3) in the 
sense that Pe{e G C/(X)) > Pei9 (£ C°(X)) for all 9, 
where X ~ N{9,I),p > 3. (Note that this set has 
the same volume as C^, but is recentered 6~^ . Dom- 
inance would thus be established if we can show 
that C^ has higher coverage probability than C^.) 
Stein's argument was heuristic, but Brown (1966) 
and Joshi (1967) proved the inadmissibility of C^ if 
p > 3 (without giving an explicit dominating proce- 
dure). Joshi (1969) also showed that C° was admis- 
sible if p < 2. 



'^ Stein's paper must be read carefully to appreciate these 
arguments. He uses a large p argument and the fact that X 
and X — 9 are orthogonal as p — >■ oo. 



(8) 



1 



6+ \x\ 



[compare to (6)] where a is made arbitrarily small 
and b is made arbitrarily large. But these existence 
results fall short of actually exhibiting a confidence 
set that dominates C^. 

The first analytical and constructive results were 
established by (surprise!) Hwang and Casella (1982), 
who studied the coverage probability of C^ in (7). 
Since C^ and C^ have the same volume, domination 
will be established if it can be shown that C^ has 
higher coverage probability for every value of 9. It 
is easy to establish that: 

o P0{9e C/(X)) is only a function of 16*1, the Eu- 
clidean norm of 9, and 

o lim|g|_^oo Pe{G € C^{X)) = 1 — a, the coverage pro- 
bability of CO. 

Therefore, to prove the dominance of C^ , it is suffi- 
cient to show that the coverage probability is a non- 
increasing function of \9\. Hwang and Casella (1982) 
derived a formula for {d/d\9\)Pe{9 G Cf{X)) and 
found a constant oq (independent of 9) such that 
if < a < Oo, C^ dominates C^ in coverage proba- 
bility for p>4:. Using a slightly different method of 
proof, Hwang and Casella (1984) extended the dom- 
inance to cover the case p = 3. This proof is outlined 
in Appendix A. The analytic proof was generalized 
to spherical symmetric distributions by Hwang and 
Chen (1986). 

There is an interesting geometrical oddity associ- 
ated with the Stein recentered confidence set. To see 
this, we first formalize our definitions of confidence 
sets. Note that for any confidence set we can speak 
of the x-section and the ^-section. That is, if we de- 
fine a confidence procedure to be a set C{9,x) in the 
product space Q x X, then: 

(1) The x-section, Cx = {9 : 9 ^ C{9,x)}, is the con- 
fidence set. 

(2) The ^-section, Cq = {x : x G C{9, x)}, the accep- 
tance region for the test Hq : {9}. 

We then have the tautology that 9 £ Cx if and only 
if X £ Co and, thus, we can evaluate the coverage 
probability P0{9 € Cx) by computing Pe{X S Cq), 
which is often a more straightforward calculation. 

For the usual confidence set, both C^ and Cg are 
spheres, one centered at x and one centered at 9. 
Although the confidence set C^ is a sphere, the as- 
sociated 0-section Ct is not, and has the shape por- 
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Fig. 1. Two-dimensional representation for C^ and C0 for 
\0\ > c, where Cg is the sphere of radius c centered at 
(shaded). The set C^ intersects Cg at point A and B (de- 
tails on the points of intersection are in Hwang and Casella, 
1982). Note the flattening of Cg on the side toward the origin 
and the decrease in volume over Co . 



trayed in Figure 1. Notice the flattening of the set 
in the side closer to in the direction perpendic- 
ular to 0, and the slight expansion away from 0. 
Stein (1962) knew of this flattening phenomenon, 
which he noted can be achieved in any fixed direc- 
tion. What is interesting is that this reshaping of 
the ^-section of the recentered set leads to a set with 
higher coverage probability than C^ when p > 3. 

3. RECENTERING AND SHRINKING 
THE VOLUME 

The improved confidence sets that we have dis- 
cussed thus far have the property that their coverage 
probability is uniformly greater than that of C^ , but 
the infimum of the coverage probability (the confi- 
dence coefficient) is equal to that of C^. For ex- 
ample, recentered sets such as C^ will present the 
same volume and confidence coefficient to an exper- 
imenter so, in practice, the experimenter has not 
gained anything. (This is, of course, a fallacy and 
a shortcoming of the frequentist inference, which re- 
quires the reporting of the infimum of the coverage 
probability.) 

However, since the coverage probability of C^ is 
uniformly higher than the infimum inf^/ Pq{6 E C^) = 
1 — a, it should be possible to reduce the radius of 
the recentered set and maintain dominance in cov- 
erage probability. 

In this section we describe some approaches to 
constructing improved confidence sets, approaches 



that not only result in a recentering of the usual set, 
but also try to reduce the radius (or, more generally, 
the volume). Some of these constructions are based 
on variations of Bayesian highest posterior density 
regions, and thus share the problem of trying to de- 
scribe exactly what the x-section, the confidence set, 
looks like. Others are more of an empirical Bayes 
approach, and tend to have more transparent geom- 
etry. 

3.1 Reducing the Volume-Bayesian Approaches 

The first attempt at constructing confidence sets 
with reduced volume considered sets with the same 
coverage probability as C^ , but with uniformly smal- 
ler volume. One of the first attempts was that of 
Faith (1976), who considered a Bayesian construc- 
tion based on a two-stage prior where 



0-Af(O,t2/), 



t ~ Inverted Gamma(a, 6), 



which is similar (but not equal) to the prior used by 
Strawderman (1971) in the point estimation prob- 
lem (Appendix B). The two-stage prior amounts to 
a proper prior with density 

^(0)oc(26+j0|2)-('^+P/2), 

the multivariate t-distribution with 2a degrees of free- 
dom. Faith then derived the Bayes decision against 
a linear loss, but modified it to the more explicitly 
defined region 



Cf 



exp(c^) 
exp(|a; — 6\ 



l/(p+2a) 26 + ^2 



26- 



\x\ 



where c is the radius of C^. It may happen that Cp 
is not convex. However, if a > —p/2 and 6 > (a + 
p/2)/8, the convexity of Cp was established. Un- 
fortunately, little else was established except when 
p = 3 or 7? = 5, where for some ranges of a and h it 
was shown that Cp has smaller volume and higher 
coverage probability than C^. 

Berger (1980) took a different approach. Using 
a generalization of Strawderman's prior, he calcu- 
lated the posterior mean 5b{x) and posterior co- 
variance matrix Tib{x) and recommended 

CB = {e:{e- 5B{x))'T.B{xr\e - 5b{x)) < xla\. 

where Xp,a is the upper a cutoff point from a chi- 
square distribution with p degrees of freedom. The 
posterior coverage probability would be exactly 1 — a 
if the posterior distribution were normal, but this is 
not the case (and the posterior coverage is not the 
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frequentist coverage). However, Berger was able to 
show that his set has very attractive coverage prob- 
abihty and smah expected volume based on partly 
analytical and partly numerical evidence. 

3.2 Reducing the Volume-Empirical Bayes 
Approaches 

A popular construction procedure for finding good 
point estimators is the empirical Bayes approach 
(see Lehmann and Casella, 1998, Section 4.6, for an 
introduction), and proves to also be a useful tool 
in confidence set construction. However, unlike the 
point estimation problem, where a direct applica- 
tion of empirical Bayes arguments led to improved 
Stein-type estimators (see, e.g., Efron and Morris, 
1973), in the confidence set problem we find that 
a straightforward implementation of an empirical 
Bayes argument would not result in a 1 — a con- 
fidence set. Modifications are necessary to achieve 
dominance of the usual confidence set. 

Suppose that we begin with a traditional normal 
prior at the first stage, and have the model 

which results in the Bayesian Highest Posterior Den- 
sity (HPD) region 

(9) C'' = {e:\9-5''{x)\^<c^M}, 

where M = r^ jij^ + 1) and S^{x) = Mx is the Bayes 
point estimator of 9. This follows from the classical 
Bayesian result that 0\x ^ N{Mx,MI). 

However, for a fixed value of r, the set C^ cannot 
have frequentist coverage probability above 1 — a for 
all values of 9. This is easily seen, as the posterior 
coverage is identically 1 — a for all x, and, hence, the 
double integral over x and 9 is equal to 1 — a. This 
means that the frequentist coverage is either equal 
to 1 — a for all 9, or goes above and below 1 — a. 
Since the former case does not hold (check ^ = and 
a nonzero value), the coverage probability of C^ is 
not always above 1 — a. 

Consequently, if we take a naive approach and 
replace r^ by a reasonable estimate, an empirical 
Bayes approach, we cannot expect that such a set 
would maintain frequentist coverage above 1 — a. 
This is because such a set would have coverage prob- 
abilities converging to those of C^ (as the sample 
size increases) and, hence, such an empirical Bayes 
set would inherit the poor coverage probability of C^ . 
This phenomenon has been documented in Casella 
and Hwang (1983). 



As an alternative to the naive empirical Bayes ap- 
proach, consider a decision-theoretic approach with 
a loss function to measure the loss of estimating the 
parameter 9 with the set C: 



(10) 



L{9,C) = kvol{C) - I{9 eC), 



where A; is a constant, vol(C) is the volume of the 
set C, and /(•) is the indicator function. Starting 
with a prior distribution it{9), the Bayes rule against 
L{9, C) is the set 



(11) 



{9:Tr{9\x)>k}, 



where 7r(9\x) is the posterior distribution. This is 
a highest posterior density (HPD) region. 

The choice of k is somewhat critical, and we chose 
it to coincide with properties of C^. Specifically, if 
we chose /c = exp(— c^/2)/(27r)^'^, then C" is min- 
imax for the loss (10). An alternative explanation 
of this choice of k is based on the reasoning that 
as T ^- oo, (11) would converge to C^, which in- 
sures that the alternative intervals would not be- 
come inferior to C^ for large t'^. (See He, 1992; 
Qiu and Hwang, 2007; and Hwang, Qiu and Zhao, 
2009.) Applying this choice of k with the normal 
prior 9 ~ N{0,t'^I) yields the Bayes set 

CI, = {9 -.19- 6^x)\ < M[c^ -plogM]}, 

where 5'^{x) and M are as in (9). By estimating 
the hyperparameters, this is then converted to an 
empirical Bayes set 



O™ 



{9:\9-6+{x)\<ve{x)}, 



where (5^(x) is the positive part estimator of (6), 
and ve{x) is given by 



ve{x) 



(12) 



P 



max(|xp,c2)_ 
c — plog 1 



max(|a;p,c^) 



When (? > p, a minor condition requiring 1 — q > 
0.55, M[c^ - plogM] t c^ as M -^ DO. It also fol- 
lows that ve{x) is bounded away from zero. This is 
important in maintaining coverage probability. Ex- 
tensive numerical evidence was given (Casella and 
Hwang, 1983) to support the claim that C^ is a uni- 
form improvement over C^. 

Confidence sets with exact 1 — a coverage proba- 
bility, with uniformly smaller volume, have also been 
constructed by Tseng and Brown (1997), adapting 
results from Brown et al. (1995). These confidence 
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sets are shown, numerically, to typically have smaller 
volume that those of Berger (1980). 

Brown et al. (1995), working on the problem of 
bioequivalence, start with the inversion of an a-level 
test and derive a 1 — a confidence interval that min- 
imizes a Bayes expected volume, that is, the volume 
averaged with respect to both x and 9. Tseng and 
Brown (1997), using a normal prior 6 ~ N(0,t'^I), 
show that the corresponding set of Brown et al. 
(1995) becomes 



C 



1 + T^ 



<kmVr') 



where k{-) is chosen so that C^ has exactly 1 — a 
coverage probability for every 6. A simple calcula- 
tion shows that the squared term in C has a non- 
central chi squared distribution, so /c(-) is the ap- 
propriate a cutoff point. In doing this, Tseng and 
Brown avoided the problem of Casella and Hwang 
(1983), and the radius does not need to be trun- 
cated. 

Of course, to be usable, we must estimate r^. The 
typical empirical Bayes approach would be to re- 
place r^ with an estimate, a function of x. However, 
Tseng and Brown take a different approach and re- 
place T^ with a function of 9, thereby maintaining 
the 1 — a coverage probability. They argue that 9 
is more directly related to r than is x, and should 
provide a better "estimator." Examples of this ap- 
proach are discussed in Hwang (1995) and Huwang 
(1996). 

The set proposed by Tseng and Brown is 



C 



TB 



1 + 



1 



A + BWl"^ 



<k 



\0\ 



A + Bm 



for constants A>0 and -B > 0, and has coverage ex- 
actly equal to 1 — q for every 9. Combining analyti- 
cal results and numerical calculations, these sets are 
shown to have uniformly smaller volume that C^. 
Moreover, Tseng and Brown also demonstrate vol- 
ume reductions over the sets of Berger (1980) and 
Casella and Hwang (1983). The only quibble with 
their approach is that the exact form of the set is 
not explicit, and can only be solved numerically. 

3.3 Reducing Volume and Increasing Coverage 

The first confidence set analytically proven to have 
smaller volume and higher coverage than C^ is that 



of Shinozaki (1989). Shinozaki worked with the x- 
section of the confidence set, starting with the set C^. 
Consider Figure 1, but drawn as the x-section cen- 
tered at X. By shrinking C° toward the origin, he was 
able to construct a new set with the same coverage 
probability as C^ but smaller volume. These sets 
can have a substantial improvement over C^, but 
smaller improvements compared to Berger (1980) 
and Casella and Hwang (1983) (especially when p 
is large and |^| is small). Moreover, there is no point 
estimator that is explicitly associated with this set. 

3.4 Other Constructions 

Samworth (2005) looked at confidence sets of the 
form 

{9:\9-6+\'^<Wa{9)}, 

where S~^ is the positive part estimator (6), Wa{9) 
is the appropriate a-level cutoff to give the confi- 
dence set coverage probability 1 — a for all 9, and X 
has a spherically symmetric distribution. He then 
replaced Wa{9) by its Taylor expansion 

and, replacing 9 with x, arrived at the confidence 

set 

{9 -.19- 5+|2 < min(u;„(0) + lw'^{0)\x\'^ ,c^)}. 

Samworth noted the importance of the quantity 
f'{c'^)/f{c'^), where / is the density of x (the rel- 
ative increasing rate of / at c^). The radius of the 
analytic confidence set only depends on the density 
through c^ and /'(c^)//(c^). This point was previ- 
ously noted by Hwang and Chen (1986) and Robert 
and Casella (1990). 

This confidence set compares favorably with that 
of Casella and Hwang (1983), having smaller volume 
especially when \x\ is small. Numerical results were 
given not only for the normal distribution, but also 
for other spherically symmetric distributions such 
as the multivariate t and the double exponential. 
Furthermore, a parametric bootstrap confidence set 
is also proposed, which also performs well. 

Efron (2006) studies the problem of confidence set 
construction with the goal of minimizing volume. 
He ultimately shows that seeking to minimize vol- 
ume may not be the best way to improve inferences, 
and that relocating the set is more important than 
shrinking it. Using a unique construction based on 
a polar decomposition of the normal density, Efron 
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derived a "confidence density" which he used to con- 
struct sets with 1 — a coverage probabihty, and ulti- 
mately a minimum volume confidence set with 1 — a 
posterior probability. 

The confidence density, which plays a large part 
in Efron's paper, is used to show the importance 
of locating the confidence set properly. The sets of 
Tseng and Brown (1997) and Casella and Hwang 
(1983) perform well on this evaluation. A minimum 
volume construction is also derived, and it is shown 
that the resulting set is not optimal in any infer- 
ential sense. Inferential properties, similar to type I 
and type II errors, are explored. It is also seen that 
as the relocated sets decrease volume of the confi- 
dence set, they increase the acceptance regions. 

4. SHRINKING THE VARIANCE 

Thus far, we have only addressed the problem of 
improving confidence regions for the mean. However, 
there is also a Stein effect for the estimation of the 
variance, and this can be exploited to produce im- 
proved confidence intervals for the variance. 

Stein (1964) was the first to notice this (of course!). 
Specifically, let Xi,. . ., Xn be i.i.d. N{fj,, a'^), univa- 
riate, where both /i and a are unknown, and calcula- 
te X = (1/n) Zi Xi and 5^ = ^^^{Xi - Xf. Against 
squared error loss, the best estimator of o"^, of the 
form c5^, has c= {n -\- 1)~^ ■ This is also the best 
equivariant estimator [with the location-scale group 
and the equivariant loss {6 — (T^)^/(T^], and is mini- 
max. Stein showed that the estimator 

6^{X,S^) = h{XyS^)S^, 

1 1 + nxys^ 



h{XyS^ 



mm 



n+1' n + 2 



uniformly dominates S'^/{n + 1). Notice that 5 {X, 
S'^) converges to S'^/{n + 1) if X'^/S'^ is big, but 
shrinks the estimator toward zero if it is small. Stein's 
proof was quite innovative (and is reproduced in the 
review paper by Maatta and Casella, 1990). The 
proof is based on looking at the conditional expec- 
tation of the risk function, conditioning on X /S, 
and showing that moving the usual estimator toward 
zero moves to a lower point on the quadratic risk sur- 
face. This approach was extended by Brown (1968) 
to establish inadmissibility results, and by Brewster 
and Zidek (1974), who found the best scale equivari- 
ant estimator. Minimax estimators were also found 
by Strawderman (1974), using a different technique. 
Turning to intervals, building on the techniques 
developed by Stein and Brown, Cohen (1972) ex- 



hibited a confidence interval for the variance that 
improved on the usual confidence interval. If (5*^/6, 
S'^/a) is the shortest 1 — a confidence interval based 
on 5^ (Tate and Klett, 1959), Cohen (1972) consid- 
ered the confidence interval 

{Syb,S^/a)I{X^/S^>k) 

+ {Syb',Sya')I{X^/S^<k), 

where /(•) is the indicator function, 1/a — 1/6 = 
1/a' — 1/b' , so each piece has the same length, but 
1/a' < 1/a and 1/6' < 1/6. So if X'^/S^ is smah, the 
interval is pulled toward zero, analogous to the be- 
havior of the Stein point estimator. Shorrack (1990) 
built on this argument, and those of Brewster and Zi- 
dek (1974), to construct a generalized Bayes confiden- 
ce interval that smoothly shifts toward zero, keeping 
the same length as the usual interval but uniformly 
increasing coverage probability. Building further on 
these arguments, Goutis and Casella (1992) con- 
structed generalized Bayes intervals that smoothly 
shifted the usual interval toward zero, reducing its 
length but maintaining the same coverage probabil- 
ity. For more recent developments on variance es- 
timation see Kubokawa and Srivastava (2003) and 
Maruyama and Strawderman (2006). 

5. CONFIDENCE INTERVALS 

In some applications there may be interest in mak- 
ing inference individually for each 9i. One example 
is the analysis of microarray data in which the inter- 
est is to determine which genes are differentially ex- 
pressed (i.e., having 6i, the difference of the true ex- 
pression between the treatment group and the con- 
trol group, different from zero). Although the confi- 
dence sets of the previous section can be projected to 
obtain confidence intervals, that will typically lead 
to wider intervals than a direct construction. 

If Xi are i.i.d. N{6i,af), i = 1, . . . ,p, the usual 
one-dimensional interval is 

Ix- = Xi ±cai, 

where c is chosen so that the coverage probability is 
1 — a. Hence, c is the a/2 upper quantile of a stan- 
dard normal. 

5.1 Empirical Bayes Intervals 

If a frequentist criterion is used, it is not possible 
to simultaneously improve on the length and cover- 
age probability of /^. in one dimension. However, it 
is possible to do so if an empirical Bayes criterion is 
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used. Morris (1983) defined an empirical Bayes con- 
fidence region with respect to a class of priors IT, 
having confidence coefficient 1 — a to be a set C{X) 
satisfying 

P^{eeCiX))= f Pei9eCiX))Tr{e)de 

> 1 - a for all tt{9) G U. 

Note that Pn{0 S C{X)) is the Bayes coverage 
probability in that both X and 9 are integrated 
out. Using normal priors with both equal and un- 
equal variance, Morris went on to construct 1 — a 
empirical Bayes confidence intervals that have av- 
erage (across i) squared lengths smaller than /^. 
Bootstrap intervals based on Morris' construction 
are also proposed in Laird and Louis (1987). 

In the canonical model 



(13) 



Xi r^ i.i.d. NiOi, I) and 
^,~i.i.d. iV(0,r2), 



He (1992) proved that there exists an interval that 
dominates /^. Precisely, for 6~^{X) of (6), it was 
shown that there exists a > such that the interval 
6^{X) ± c has higher Bayes coverage probability for 
any r^ > 0. 

The approach He took is similar to the approach of 
Casella and Hwang (1983), using a one-dimensional 
loss function similar to the linear loss (10) except 
that 6 is replaced by only the component 6i of in- 
terest. As in the discussion following (10), k and c 
need to be properly linked. With such a choice of k, 
the decision Bayes interval is then approximated by 
its empirical Bayes counterpart: 



C?- 



{0r.\e,-5t{x)\'<H\x\)}. 



Here 6^{X) is the ith. component of the James-Stein 
positive part estimator (6) with a = p — 2, 

iy{\X\) =M{c^- log M), 
(14) 

p-2\^ 1 



M: 



max 



1 



1^1 



p— 1 



Note the resemblance to (12). There is also a trunca- 
tion carried out in the definition of M so that J^(|X|) 
is bounded away from zero. 

It can be shown that the length of C^'^ is always 
smaller than that of I^ for each individual coordi- 



nate, i as long as c> 1, or, equivalently, 1 — a > 007o. 
In contrast, in Morris (1983) only the average length 
across i was made smaller. 



Numerical studies in He (1992) demonstrated that 
his interval is an empirical Bayes confidence interval 
with 1 — a confidence coefficient. Also, on average, 
it has shorter length than the intervals of Morris 
(1983) or Laird and Louis (1987) when a = 0.05 or 
0.1. He concluded that his interval is recommended 
only if a < 0.1. Interestingly, in modern application 
with the concerns of multiple testings, a small value 
of a is more important. 

5.2 Intervals for the Selected Mean 

An important problem in statistics is to address 
the confidence estimation problem after selecting 
a subset of populations from a larger set. This is 
especially so if the number p of populations is huge 
and the number of selected populations, k, is rela- 
tively small, a scenario typical in microarray exper- 
iments. For example, ignoring the selection and just 
estimating the parameters of the selected popula- 
tions by the sample means would have serious bias, 
especially if the populations selected are the ones 
with largest corresponding sample means. In such 
a situation, intuition would suggest that some kind 
of shrinkage approach is very much needed. 

Specifically, we consider the canonical model 



(15) 



Xi r^ li.d. N{ei,af) and 
0,~i.i.d. iV(^,r2). 



Let 0(j) be the parameter of the selected population. 



that is, it is the 9j such that Xj 



■ Xfi) where 



(16) 



^(1) < -^^(2) < 



<x, 



ip) 



are the order statistics of {Xi, . . . ,Xp). In partic- 
ular, 0(p) is the 6 that corresponds to the largest 



observation X, 



(p) 



: maxj Xj . Note that it is not true 



that 6'(i) < 0(2) < • • • < 9{p)- In particular, 0(p) is not 
necessarily the largest of the 0j's. It is just that 9j 
happens to have produced the largest observations 
among the Xj's. 

In the point estimation problem, the naive esti- 
mator of 9/p-j is Xfp-j, which can be intuitively seen 
to be an overestimate, especially if all 0i are equal. 
A shrinkage estimator adapted to this situation would 
seem more reasonable. Hwang (1993) was able to 
show that for estimating 9(p\, a variation of the posi- 
tive-part estimator (6), with Xi replaced by ^(j), 
has, for every fi and r^, smaller Bayes risk than Xu\ 
with respect to one-dimensional squared error loss. 

For the construction of confidence intervals, Qiu 
and Hwang (2007) adapted the approach of Casella 
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and Hwang (1983) and He (1992) to this problem. 
For any selection, they constructed 1 — a empirical 
Bayes confidence intervals for 6u) which are shown 
numerically to have confidence coefficient 1 — a when 
(Ji = a is either known or estimable. Moreover, the 
interval is everywhere shorter than even the tradi- 
tional interval, Xu\ zb ca, which does not maintain 
1 — a coverage in this case. 

Interestingly, in one microarray data set, Qiu and 
Hwang (2007) found that the normal prior did not 
fit the data as well as a mixture of a normal prior 
and a point mass at zero. For the mixture prior, 
an empirical Bayes confidence interval for 6u\ was 
constructed and shown (numerically and asymptot- 
ically as p — 7> oo) to have empirical Bayes confidence 
coefficient at least 1 — a. 

Further, combining k empirical Bayes 1 — a/A; con- 
fidence intervals for 0(j), i G S, where S consists of k 
indices of the selected ^(j)'s, yields a simultaneous 
confidence set (rectangle) that has empirical Bayes 
coverage probability above the nominal 1 — a level. 
Furthermore, their sizes could be much smaller than 
even the naive rectangles (which ignore selection and 
hence have poor coverage). This can also lead to 
a more powerful test. 

5.3 Shrinking Means and Variances 

Thus far, we have only discussed procedures that 
shrink the sample means, however, confidence sets 
can also be improved by shrinking variances. In Sec- 
tion 4 we saw how to construct improved intervals 
for the variance. In Berry (1994) it was shown that 
using an improved variance estimator can slightly 
improve the risk of the Stein point estimator (but 
not the positive-part). Now we will see that we can 
substantially improve intervals for the mean by us- 
ing improved variance estimates, when there are 
a large number of variances involved. 

Hwang, Qiu and Zhao (2009) constructed empiri- 
cal Bayes confidence intervals for 9i where the center 
and the length of the interval are found by shrink- 
ing both the sample means and sample variances. 
They took an approach similar to He (1992), ex- 
cept that the task is complicated by putting yet 
another prior on af. The prior assumption is that 
log af is distributed according to a normal distribu- 
tion (or cjf has an inverted gamma distribution). In 
both cases, their proposed double shrinkage confi- 
dence interval maintains empirical Bayes coverage 
probabilities above the nominal level, while the ex- 
pected length are always smaller than the t-interval 
or the interval that only shrinks means. Simulations 
show that the improvements could be up to 50%. 



The confidence intervals constructed are shown to 
have empirical Bayes confidence coefficient close to 
1 — Q. In all the numerical studies, including ex- 
tensive simulation and the application to the data 
sets, the double shrinkage procedure performed bet- 
ter than the single shrinkage intervals (intervals that 
shrink only one of the sample means or sample vari- 
ances but not both) and the standard t interval 
(where there is no shrinkage). 

6. DISCUSSION 

The confidence sets that we have discussed broadly 
fall into two categories: those that are explicitly de- 
fined by a center and a radius (such as Berger, 1980, 
or Casella and Hwang, 1983), and those that are 
implicit (such as Tseng and Brown, 1997). For ex- 
perimenters, the explicitly defined intervals may be 
slightly preferred. 

The improved confidence sets typically work be- 
cause they are able to reduce the volume of the 
x-section (the confidence set) without reducing the 
volume of the 6'-section (the acceptance region) . As 
the coverage probability results from the ^-section, 
the result is an improved set in terms of volume and 
coverage. 

Another point to note is that most of the sets 
presented are based on shrinking toward zero. More- 
over, the improved sets will typically have greatest 
coverage improvement near zero, that is, near the 
point to which they are shrinking. The point zero 
is, of course, only a convenience, as we can shrink 
toward any point /io by translating the problem to 
X — fiQ and — fiQ, and then obtain the greatest confi- 
dence improvement when x — fiQ is small. Moreover, 
we can shrink toward any linear subset of the pa- 
rameter space, for example, the space where the co- 
ordinates are all equal, by translating to x — xl and 
9 — 61, where 1 is a vector of Is. This is developed 
in Casella and Hwang (1987). 

The Stein effect, which was discovered in point 
estimation, has had far-reaching infiuence in confi- 
dence set estimation. It has shown us that by tak- 
ing into account the structure of a problem, possibly 
through an empirical Bayes model, improved point 
and set estimators can be constructed. 



APPENDIX A: 



PROOF OF DOMINANCE 

OF C+ 



Hwang and Caseha (1982) show that Pg{e G C+) 
is decreasing in |^|, and hence has minimum 1 — a at 
1^1 = CO. The proof is somewhat complex, and only 
holds for p > 4. Hwang and Casella (1984) found 
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a simpler approach, which extended the result to 
p = 3. We outline that approach here. 

For the set C+ = {9 : \9 - 5+(x)| < c}, the follow- 
ing lemma shows that we do not have to worry about 

\9\<c. 



Lemma A.l. 
and \9\ < c, 



For X ~ N{9^ I) and every a > 



Pe{9 : \9 - S+{X)\ < c) > Pe{9 ■.\9-X\< c). 

Proof. The assumption |^| < c implies that € 
Cg, the 0-section (acceptance region). Therefore, by 
the convexity of Cg, 



x£C^ 



5+(x)GC0 



since 6'^{x) is a convex combination of and x. Fi- 
nally, since 6~^{x) € Cg , we then have |5'^(x) — 9\<c 
so Cq C C^ and the theorem is proved. D 

It is interesting that, even though the confidence 
sets (the x-sections) have exactly the same volume; 
for small 9 the 0-section of the 6^ procedure con- 
tains the ^-section of the usual procedure. 

In addition to not needing to worry about \9\ < c, 
there is a further simplification if |0| > c. If |0| > c, 
the inequality \9 — 6~^{x)\ < c is equivalent to 

\9 — 6'^{x)\<c and |x| > a, 

which allows us to drop the "-|-." Note that if |0| > c 
and |xp < a, then \9 — 5~^{x)\ > c. 

Last, we note that if a = 0, then the two proce- 
dures are exactly the same and, thus, a sufficient 
condition for domination of C^ by C^ is to show 
that 



(A^l) 



^j.(o,cn>o 



for every |0| > c and a in an interval including 0. 
The inequality (A.l) was established in Hwang and 
Casella (1984) through the use of the polar trans- 
formation {x,9) —7- {r,f3), where r = |x| and x'9 = 
\x\\9\ cos(/3), so f3 is the angle between x and 9. The 
polar representation of the coverage probability is 
differentiable in a, and the following theorem was 
established. 

Theorem A. 2. For p > 3, the coverage proba- 
bility of Cg' is higher than that of C^ for every 9 
provided < a<a* , where a* is the unique solution 
to 



l±{i+on^Y~\^cv^ 



Solutions to this equation are easily computed, 
and it turns out that a* ~ 0.8(p — 2), which does 
not quite get to the value p — 2, the optimal value 
for 6'^^ and the popular choice for 6~^ . However, the 
coverage probabilities are very close. Moreover, the 
theorem provides a sufficient condition, and it is no 
doubt the case that a = p — 2 achieves dominance. 

APPENDIX B: THE STRAWDERMAN PRIOR 

The first proper Bayes minimax point estimators 
were found by Strawderman (1971) using a hierar- 
chical prior of the form 



X\9r^Np{9,I), 



|A~ iVp 0, 



1-A 



1. 



A 
A ~ (1 - a)X^'', 0<A<1, 0<a<l. 

The Bayes estimator for this model is E(0|x) = 
[1 — E(A|x)]x. The function E(A|x) is a bounded in- 
creasing function of |x|, and Strawderman was able 
to show, using an extension of Baranchik's (1970) 
result, that for p > 5 the Bayes estimator is min- 
imax. An interesting point about this hierarchy is 
that the unconditional prior on 9 is approximately 
2^/j0|p+2-2a^ giving it t-like tails. These are the types 
of priors that lead to Bayesian posterior credible sets 
with good coverage probabilities. 

Faith (1978) used a similar hierarchical model with 
9 ^ N{0,t'^I) and t^ ~ Inverted Gamma(a,6), lead- 
ing to an unconditional prior on 9 of the form tt{9) ~ 
{2b+ |6l|2)"(P/2+"), the multivariate t distribution. In 
his unpublished Ph.D. thesis. Faith gave strong evi- 
dence that the Bayesian posterior credible sets had 
good coverage properties. 

Berger (1980) used a generalization of Strawder- 
man's prior, which is more tractable than the t prior 
of Faith, to allow for input on the covariance struc- 
ture. 
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